An open-source framework for Rift Valley Fever forecasting

Overview of OpenRVFcast

The goal of EcoHealth Alliance’s ongoing OpenRVFcast project the development of a generalizable, open-source modeling framework for predicting Rift Valley Fever (RVF) outbreaks in Africa, funded by the Wellcome Trust’s climate-sensitive infectious disease modeling initiative. We aim to integrate open data sets of climatic and vegetation data with internationally-reported outbreak data to build an modeling pipeline that can be adapted to varying local conditions in RVF-prone regions across the continent.

This project is a collaborative effort between EcoHealth Alliance, [INSERT PARTNER LINKS]

Pipeline Structure

The project pipeline is organized into two distinct modules: 1) the Data Acquisition Module and 2) the Modeling Framework Module. Both modules are orchestrated using the targets package in R, a powerful tool for creating reproducible and efficient data analysis workflows. By defining a workflow of interdependent tasks, known as ‘targets’, this package ensures that each step in the workflow is only executed when its inputs or code change, thereby optimizing computational efficiency. A modular, scalable, and transparent design makes targets an ideal choice for managing pipelines in reproducible research and production environments. An introduction to workflow management using targets can be found here. This project also uses the {renv} framework to track R package dependencies and versions which are recorded in the renv.lock file. Code used to manage dependencies is in renv/ and other files in the root project directory. On starting an R session in the working directory, run `renv::hydrate() and renv::restore() to install required R packags and dependencies.

Repository Structure

Project code is available on the open-rvfcast GitHub repository which is organized with the following structure:

Data Storage

We utilized parquet files and the arrow package in R as our primary method of storing data. Parquet files are optimized for high-performance, out-of-memory data processing, making it well-suited for efficiently handling and processing large, complex datasets. Additionally, arrow::open_dataset() supports seamless integration with cloud storage, enabling direct access to remote datasets, which improves workflow efficiency and scalability when working with large, distributed data sources. While the data acquisition module requires the processing of large datasets, the final cleaned data can be accessed directly from the cloud by opening the following connection:

dataset <- open_dataset("s3://open-rvfcast/data/africa_full_model_data")
dataset$schema

As parquet files are a columnar format with structured metadata available in each file, some operations, such as filtering, summarizing, and inspecting the data schema can be applied directly to remote datasets without having to first download the full data. Calling collect() on the dataset will initiate the download. For example, the following will filter the data and then download the model data for a single day:

dataset <- open_dataset("s3://open-rvfcast/data/africa_full_model_data") |> 
filter(date == "2023-12-14") |> 
collect()

However, due to computational demands of such large data, the model analysis pipeline will download the data in entirety before analysis. In addition, the dataset has been subsetted to two randomly chosen days per month between 2007 and 2024.

1. Data Acquisition Module

Cloud Storage

Many of the computational steps in the first module can be time consuming and either depend on or produce large files. In order to speed up the pipeline, intermediate files can be stored on the cloud for portability. We currently use an AWS S3 bucket for this purpose. The pipeline will still run without access to cloud storage, but users can add their own AWS access keys and bucket ID to the .env file to enable cloud storage.

Environment variables to add to the .env file:

AWS_DEFAULT_REGION=
AWS_REGION=
AWS_BUCKET_ID=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

Data Access

Acquiring the raw source data stores involves first obtaining authentication credentials, such as API keys, tokens, and certificates. There are three primary sources of data that require access credentials 1. ECMWF: for accessing monthly weather forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF). 2. COPERNICUS: for accessing Normalized Difference Vegetation Index (NDVI) data derived from the European Space Agency’s Sentinel-3 satellite. 3. APPEEARS: for accessing historical NDVI data prior to the Sentinel-3 mission from NASA MODIS satellites.

Before running the data acquisition pipeline, credentials for all three sources must be added to the .env file

Environment variables to add to the .env file:

ECMWF_USERID=
ECMWF_TOKEN=
COPERNICUS_USERNAME=
COPERNICUS_PASSWORD=
APPEEARS_USERNAME=
APPEEARS_PASSWORD=
APPEEARS_TOKEN=

Data Sources

All spatial data were interpolated to a resolution of 0.1° across Africa and standardized to the WGS 84 coordinate reference system. All temporal data layers were joined by date.

If data files become corrupted they can be re-generated from the raw sources by setting the OVERWRITE_X flags to TRUE in the .env file. This will prevent the pipeline from first downloading the data on AWS, re-download and process the raw data from the original sources, and upload the processed files to AWS for future use. Note that, under normal use, these should always be set to FALSE. The pipeline will automatically download any missing data without having to change these settings. This is only to replace data that has already been downloaded and processed mainly for pipeline development purposes.

The Response Variable

The goal of this project is to evaluate the potential for an outbreak of Rift Valley fever (RVF) to occur across Africa. The model was trained against a binary variable representing whether or not an outbreak occurred at each spatial location 0-30 days, 30-60 days, 60-90 days, 90-120 days, and 120-150 days after every date. RVF outbreak data was provided by the World Animal Health Information System (WOAH) and accessed via a database of cleaned outbreak data managed by EcoHealth Alliance.

  1. RVF_occurance: A binary factor reflecting RVF occurance at each location across the 5 forecast intervals.

Static Data

The following data sources are static, or time-invariant. Raw static data was downloaded from the linked sources and joined with dynamic data, such as temperature, which varied by day.

  1. Soil types: Soil types based on the Food and Agriculture Organization of the United Nations (FAO) Harmonized World Soil Database v2.0 (HWSD) with soil types aggregated into 8 categories: clay (heavy) + clay loam (1), silt loam + silty clay (2), sandy clay + clay (3), loam + silty clay loam (4), sandy clay loam (5), sandy loam + silt (6), loamy sand + silt loam (7), and sand (8) based on similarity in the USDA sand-silt-clay ternary texture class diagram (Figure 2). Data was aggregated by identifying the most common slope or aspect within each 0.1 degree grid cell.
  2. Slope and Aspect: Slope and aspect data from the FAO Global Terrain Slope and Aspect
  3. Gridded Livestock of the World 3 (GLW3): Global distribution data included cattle, sheep, and goats censused in 2010 and available at a native resolution of 5 arc-minutes. Data was accessed via the Harvard dataverse.
  4. Elevation: Elevation data accessed via the elevation_global() function of the geodata package in R, drawn from the Shuttle Radar Topography Mission (SRTM) at resolution of 0.5 minutes of a degree.
  5. Bioclimatic data*: Bioclimactic data from the WorldClim version 2.1 accessed via the worldclim_global() function of the geodata package in R and represent the global mean values across the period of 1970-2000 at a 2.5m resolution.
  6. Landcover type: Landcover data was accessed via the landcover() function of geodata package in R, drawn from the ESA WorldCover Database with a spatial resolution of 30 arc-seconds. Values for each landcover type (trees, grassland, shrubs, cropland, built, bare, snow, water, wetland, mangroves, and moss), reflect the fraction of each a landcover class at each location.

* Bioclimactic variables included: Annual_Mean_Temperature, Mean_Diurnal_Range, Isothermality, Temperature_Seasonality, Max_Temperature_of_Warmest_Month, Min_Temperature_of_Coldest_Month, Temperature_Annual_Range, Mean_Temperature_of_Wettest_Quarter, Mean_Temperature_of_Driest_Quarter, Mean_Temperature_of_Warmest_Quarter, Mean_Temperature_of_Coldest_Quarter, Annual_Precipitation, Precipitation_of_Wettest_Month, Precipitation_of_Driest_Month, Precipitation_Seasonality, Precipitation_of_Wettest_Quarter, Precipitation_of_Driest_Quarter, Precipitation_of_Warmest_Quarter, and Precipitation_of_Coldest_Quarter

Dynamic Data

Dynamic data sources are those that vary with time. Dynamic predictors can be highly conflated with each other due to a shared dependence on time, to account for this shared dependence, we used calculated the anomaly, or difference between current values and historical means, instead of using the raw values. Anomalies were calculated by first determining the difference between the current value and its historical mean for that day-of-year (DOY) and scaled by dividing by the standard deviation for that DOY. Focusing on anomalous values helped mitigate the strong correlation with time that naturally exists in environmental variables like temperature and NDVI. Seasonality was then accounted for by including year and day-of-year (DOY) as predictors in the model. The following sources make up the dynamic layers:

  1. weather_anomalies: NASA weather data was acquired across Africa using the get_power() function of the nasapower package in R which provides access to NASA meteorological data from the NASAPOWER project. The difference, or anomaly value, was then found by subtracting each weather value from the average value for that day-of-year (DOY).
  2. ndvi_anomalies: NDVI data was sourced from both the NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) and the European Space Agency’s Copernicus Sentinel-3 missions. MODIS is due to be retired in 2025 while Sentinel-3 NDVI data is available from September 2018. MODIS and Sentinel-3 NDVI values were interpolated to a daily interval from their native 16 day (MODIS) and ~10 day (Sentinel-3) intervals using a step-function and NDVI averaged when data from both sources were available. The difference, or anomaly value, was then found by subtracting NDVI from the average value for that day-of-year (DOY).
Weather Forecasts
  1. ecmwf_forecasts We also included long-range projections of future weather provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) and accessed through the Copernicus Climate Data Store (CDS). The projected data represent the mean of a 51-member ensemble and include the expected average temperature, precipitation, and relative humidity for each location across different forecast intervals. Historical forecasts were available through hindcasts, which apply the current forecasting methods to historical data to simulate what forecasts would have been available at those times based on past conditions.
Lagged Dynamic Data

Outbreak occurrence is not always directly influenced by the immediately preceding conditions. Biological systems often involve delayed responses. For example, heavy precipitation may promote a mosquito hatch, which can lead to an outbreak only after a delay. To account for the influence of past environmental conditions, we included lagged weather and NDVI data, specifically the average values from 0-30, 30-60, 60-90, 90-120, and 120-150 days prior.

  1. weather_anomalies: Average weather anomaly values lagged over the previous 1-5 months
  2. ndvi_anomalies_lagged: Average NDVI anomaly values lagged over the previous 1-5 months
Historical Outbreak Data

An important factor in evaluating the potential for a future outbreak is the history of outbreaks in a region. Recent nearby outbreaks can amplify the likelihood of an outbreak occurring at a given location, while older outbreaks might reduce the risk by influencing the resistance landscape, reflecting a history of prior exposure to the disease.

To account for the influence of outbreak history, we generated outbreak exposure weights for both recent and historical outbreaks. These weights were determined using a function that decreases with distance from the source, modeling exposure as declining exponentially outward to a maximum distance of 500 km with an exponential rate of decay of 0.01km-1. Similarly, the effects of an outbreak were assumed to fade over time, with influence declining as time elapsed since the outbreak increased out to a maximum of 10 years at an exponential rate of decay of 0.5year-1. Outbreaks that occurred within the last 3 months were classified as ‘recent’ and included as a separate predictor in the model allowing them to have a different effect on the model outcome compared to the older outbreak exposures.

  1. outbreak_history: Outbreak history was calculated using the data provided from same data described in the response section (item 1) above. As outbreak history contains information about the state of variable being predicted, special care was taken when splitting the data into test and training datasets to prevent data leakage described further below.

Targets Pipeline

A visualization of the data acquisition module can be found below. Additional targets not shown are responsible for fetching and storing intermediate datasets on the cloud. To run the data acquisition module, download the repository from github and run the following command. Note, without access to the common S3 bucket store this pipeline will take a significant amount of time and space to run. In addition, without access to the remote data store, the data acquisition module must be run before running the modeling module.

tar_make(script = "data_acquisition_targets.R")

The schematic figure below summarizes the steps of the data acquisition module. The figure is generated using mermaid.js syntax and should display as a graph on GitHub. It can also be viewed by pasting the code into https://mermaid.live.)

2. Rift Valley Fever (RVF) risk model pipeline

Data Partitioning

Splitting data into training, validation, and test sets is an important step for building robust and reliable models. The training set is used to learn model parameters, the validation set helps fine-tune hyperparameters and prevent overfitting, and the test set provides an unbiased evaluation of the model’s performance on unseen data. Proper splitting ensures the model generalizes well to new data, avoiding issues like data leakage or over-optimistic performance estimates.

However, splitting outbreak data can be particularly challenging due to spatial and temporal clustering, which can lead to imbalanced or non-representative splits. Ensuring that all three splits contain representative data, including both outbreak presence and absence, is critical for robust model evaluation.

Spatial splitting

Spatial splitting was accomplished by spatial blocking using the spatial_block_cv() function of the spatialsample to create spatial cross-validation folds. This ensures that each split contains distinct spatial regions, at the level of municipality that contain representive information in all three splits.

Temporal splitting

In addition to spatial clustering, outbreak data is time-series by nature, necessitating techniques like expanding window splitting where the training set grows incrementally over time as more data becomes available. This approach is particularly suited for scenarios where temporal dependencies exist, and models must be evaluated on their ability to generalize to future, unseen data. When outbreaks are rare, subdividing the limited positive detections can exacerbate the imbalance, making it harder to accurately assess the model’s performance and generalizability.

Model Structure

Evaluating Model Performance

Generating Dynamic Documentation and Reports

Targets Pipeline

A visualization of the data acquisition module can be found below.

tar_make(script = "model_framework_targets.R")

The schematic figure below summarizes the steps of the data acquisition module. The figure is generated using mermaid.js syntax and should display as a graph on GitHub. It can also be viewed by pasting the code into https://mermaid.live.)

Waywiser

Follow the links for more information about: